This files is a summary of the work made by GMM5 students on data coming from ENVT study.

The data are processed using the following packages:

library(mixOmics)
library(metagenomeSeq)
library(reshape2)

Data description

Data are given in two files included in the directory data :

df_abundances <- read.delim("../data/abondances.csv", sep = ",", 
                            stringsAsFactors = FALSE)
summary(df_abundances[ ,1:15])
##  X.blast_taxonomy   blast_subject      blast_perc_identity
##  Length:406         Length:406         Min.   : 94.18     
##  Class :character   Class :character   1st Qu.: 99.39     
##  Mode  :character   Mode  :character   Median : 99.80     
##                                        Mean   : 99.32     
##                                        3rd Qu.:100.00     
##                                        Max.   :100.00     
##  blast_perc_query_coverage  blast_evalue blast_aln_length
##  Min.   : 98.13            Min.   :0     Min.   :430.0   
##  1st Qu.:100.00            1st Qu.:0     1st Qu.:466.2   
##  Median :100.00            Median :0     Median :480.0   
##  Mean   : 99.96            Mean   :0     Mean   :477.4   
##  3rd Qu.:100.00            3rd Qu.:0     3rd Qu.:490.0   
##  Max.   :100.00            Max.   :0     Max.   :512.0   
##    seed_id          observation_name   observation_sum   
##  Length:406         Length:406         Min.   :   224.0  
##  Class :character   Class :character   1st Qu.:   588.5  
##  Mode  :character   Mode  :character   Median :   979.5  
##                                        Mean   :  9753.7  
##                                        3rd Qu.:  2738.0  
##                                        Max.   :880523.0  
##  NG.10214_EN10A_lib136338_4869_1 NG.10214_EN10B_lib136339_4869_1
##  Min.   :    0.0                 Min.   :    0.0                
##  1st Qu.:    0.0                 1st Qu.:    0.0                
##  Median :    0.0                 Median :    0.0                
##  Mean   :  108.4                 Mean   :  108.4                
##  3rd Qu.:   12.0                 3rd Qu.:   12.0                
##  Max.   :17044.0                 Max.   :17663.0                
##  NG.10214_EN11A_lib136340_4869_1 NG.10214_EN11B_lib136341_4869_1
##  Min.   :    0.0                 Min.   :    0.0                
##  1st Qu.:    0.0                 1st Qu.:    0.0                
##  Median :    1.0                 Median :    1.0                
##  Mean   :  108.4                 Mean   :  108.4                
##  3rd Qu.:    6.0                 3rd Qu.:    6.0                
##  Max.   :24223.0                 Max.   :23739.0                
##  NG.10214_EN15A_lib136342_4869_1 NG.10214_EN15B_lib136343_4869_1
##  Min.   :    0.00                Min.   :    0.00               
##  1st Qu.:    0.00                1st Qu.:    0.00               
##  Median :    3.00                Median :    4.00               
##  Mean   :  108.37                Mean   :  108.37               
##  3rd Qu.:   15.75                3rd Qu.:   15.75               
##  Max.   :28421.00                Max.   :27946.00

The first 9 columns contain information about the different bacteria identified within samples. The following columns contain the two technical replicates (identified by “A” and “B”) of all the farms involved in the study (the farm is identified by a number preceeding the letter “A”/“B”).

Note that some “blast taxonomy” are duplicated:

unique(names(which(table(df_abundances[ ,1]) > 1)))
##  [1] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Corynebacteriaceae;Corynebacterium 1;Corynebacterium sp."     
##  [2] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Corynebacteriaceae;Corynebacterium 1;unknown species"         
##  [3] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Corynebacteriaceae;Corynebacterium;unknown species"           
##  [4] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Dietziaceae;Dietzia;Dietzia sp."                              
##  [5] "Bacteria;Actinobacteria;Actinobacteria;Corynebacteriales;Nocardiaceae;Rhodococcus;Rhodococcus sp."                     
##  [6] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Brevibacteriaceae;Brevibacterium;Brevibacterium antiquum"         
##  [7] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Brevibacteriaceae;Brevibacterium;unknown species"                 
##  [8] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Intrasporangiaceae;Janibacter;unknown species"                    
##  [9] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Microbacteriaceae;Leucobacter;unknown species"                    
## [10] "Bacteria;Actinobacteria;Actinobacteria;Micrococcales;Micrococcaceae;Arthrobacter;Arthrobacter sp."                     
## [11] "Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces sp."               
## [12] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Bacteroidaceae;Bacteroides;unknown species"                           
## [13] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Petrimonas;unknown species"                        
## [14] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Porphyromonas;unknown species"                     
## [15] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Porphyromonadaceae;Proteiniphilum;unknown species"                    
## [16] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Alloprevotella;unknown species"                        
## [17] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella 1;unknown species"                          
## [18] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella 2;unknown species"                          
## [19] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Prevotellaceae;Prevotella 9;unknown species"                          
## [20] "Bacteria;Bacteroidetes;Bacteroidia;Bacteroidales;Rikenellaceae;Rikenellaceae RC9 gut group;unknown species"            
## [21] "Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Flavobacteriaceae;Bergeyella;unknown species"                   
## [22] "Bacteria;Bacteroidetes;Flavobacteriia;Flavobacteriales;Flavobacteriaceae;Flavobacterium;unknown species"               
## [23] "Bacteria;Bacteroidetes;Sphingobacteriia;Sphingobacteriales;Sphingobacteriaceae;Olivibacter;unknown species"            
## [24] "Bacteria;Bacteroidetes;Sphingobacteriia;Sphingobacteriales;Sphingobacteriaceae;Pedobacter;unknown species"             
## [25] "Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;Caryophanon;Caryophanon latum"                                   
## [26] "Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Jeotgalicoccus;unknown species"                               
## [27] "Bacteria;Firmicutes;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus;Staphylococcus sp."                            
## [28] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Aerococcaceae;Facklamia;unknown species"                                   
## [29] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Aerococcaceae;unknown genus;unknown species"                               
## [30] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Carnobacteriaceae;Trichococcus;unknown species"                            
## [31] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Carnobacteriaceae;unknown genus;unknown species"                           
## [32] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Enterococcaceae;Enterococcus;unknown species"                              
## [33] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;Streptococcus pluranimalium"                
## [34] "Bacteria;Firmicutes;Bacilli;Lactobacillales;Streptococcaceae;Streptococcus;unknown species"                            
## [35] "Bacteria;Firmicutes;Clostridia;Clostridiales;Clostridiaceae 1;Clostridium sensu stricto 1;unknown species"             
## [36] "Bacteria;Firmicutes;Clostridia;Clostridiales;Family XI;Peptoniphilus;unknown species"                                  
## [37] "Bacteria;Firmicutes;Clostridia;Clostridiales;Lachnospiraceae;Blautia;unknown species"                                  
## [38] "Bacteria;Firmicutes;Clostridia;Clostridiales;Lachnospiraceae;Lachnoclostridium;unknown species"                        
## [39] "Bacteria;Firmicutes;Clostridia;Clostridiales;Lachnospiraceae;Lachnospiraceae NK3A20 group;unknown species"             
## [40] "Bacteria;Firmicutes;Clostridia;Clostridiales;Peptostreptococcaceae;Intestinibacter;unknown species"                    
## [41] "Bacteria;Firmicutes;Clostridia;Clostridiales;Peptostreptococcaceae;Peptoclostridium;unknown species"                   
## [42] "Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae;Faecalibacterium;unknown species"                         
## [43] "Bacteria;Firmicutes;Clostridia;Clostridiales;Ruminococcaceae;Ruminococcaceae UCG-005;unknown species"                  
## [44] "Bacteria;Firmicutes;Erysipelotrichia;Erysipelotrichales;Erysipelotrichaceae;Erysipelothrix;unknown species"            
## [45] "Bacteria;Fusobacteria;Fusobacteriia;Fusobacteriales;Leptotrichiaceae;unknown genus;unknown species"                    
## [46] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Hyphomicrobiaceae;Devosia;unknown species"                     
## [47] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium sp."     
## [48] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Phyllobacteriaceae;Aquamicrobium;unknown species"              
## [49] "Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Rhizobium;Agrobacterium tumefaciens"              
## [50] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Alcaligenaceae;Alcaligenes;Alcaligenes faecalis"            
## [51] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Comamonas;unknown species"                   
## [52] "Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Oxalobacteraceae;Massilia;unknown species"                  
## [53] "Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Escherichia-Shigella;Escherichia coli"
## [54] "Bacteria;Proteobacteria;Gammaproteobacteria;Pasteurellales;Pasteurellaceae;Mannheimia;Mannheimia sp."                  
## [55] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Moraxella;Moraxella oblonga"                 
## [56] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Moraxellaceae;Moraxella;unknown species"                   
## [57] "Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas sp."              
## [58] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Luteimonas;unknown species"               
## [59] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Stenotrophomonas;Stenotrophomonas sp."    
## [60] "Bacteria;Proteobacteria;Gammaproteobacteria;Xanthomonadales;Xanthomonadaceae;Stenotrophomonas;unknown species"         
## [61] "Bacteria;Tenericutes;Mollicutes;Mycoplasmatales;Mycoplasmataceae;Mycoplasma;Mycoplasma bovirhinis"

as well as some of the bast id themselves

unique(names(which(table(df_abundances[ ,2]) > 1)))
##  [1] "AB680687.1.1434" "AB681778.1.1454" "AF224286.1.1534"
##  [4] "AJ491302.1.1509" "AM183097.1.1441" "AY243344.1.1501"
##  [7] "AY725811.1.1474" "FJ672272.1.1408" "FJ674571.1.1375"
## [10] "FJ674662.1.1457" "GQ358834.1.1513" "GQ358846.1.1455"
## [13] "HM325988.1.1341" "HM327870.1.1341" "JX986968.1.1511"
## [16] "KC894531.1.1498"

We need to know what to do with those…?

Also, species names are extracted (last not unknown name) by

species <- sapply(df_abundances[ ,1], function(aname) 
  unlist(strsplit(aname, ";")))
species <- sapply(species, function(avect) {
  find_unknown <- grep("unknown", avect)
  if (length(find_unknown) > 0) {
    return(avect[-find_unknown])
  } else return(avect)
})
species <- unlist(sapply(species, function(avect) avect[length(avect)]))
species <- unname(species)

#l'esp?ce c'est le dernier mot de chaque ligne avec les bact?ries

Microbiote analysis

Preprocessing and normalization

First, the two technical replicates are merged (simple sums as the counts have already been normalized to identical library sizes)

abundances <- df_abundances[ ,grep("A_", colnames(df_abundances))] +
  df_abundances[ ,grep("B_", colnames(df_abundances))]
dim(abundances)
## [1] 406  45

which leads to 45 samples (columns) in which 406 bacteria have been observed (rows).

Condition (“EN” or “LBA”) is extracted from column names:

condition <- rep("LBA", ncol(abundances))
condition[grep("EN", colnames(abundances))] <- "EN"
table(condition)
## condition
##  EN LBA 
##  22  23

Also, the farm identifier is extracted from column names:

id_abundances <- as.character(colnames(abundances))
id_abundances <- sapply(id_abundances, function(ac) 
  substr(ac, nchar(ac) - 19, nchar(ac) - 18))
id_abundances <- gsub("A", "0", id_abundances)
id_abundances <- gsub("N", "0", id_abundances)
table(id_abundances)
## id_abundances
## 01 03 04 06 07 09 10 11 15 16 17 18 19 20 21 23 24 25 26 28 29 30 31 
##  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  1  2  2

All but one farm (idenfier: 29) have been sampled twice, once for each condition.

Exploratory analysis: distribution of one sample

The effect of different normalization is first explored by analyzing the distributions of the counts in the first sample before and after normalization. Distribution before normalization is provided as:

df <- abundances
names(df) <- paste0("sample", 1:ncol(df))
ggplot(df, aes(x = sample1 +1)) + geom_histogram(bins = 50) + scale_y_log10() +
  theme_bw() + xlab("counts (sample 1)") + ggtitle("Sample 1 distribution")
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 41 rows containing missing values (geom_bar).

and with a log-transformation by

ggplot(df, aes(x = sample1 + 1)) + geom_histogram(bins = 50) +
  theme_bw() + xlab("counts (sample 1)") + scale_x_log10() +
  ggtitle("Sample 1 distribution (log scale)")

It is commun in metagenomic datasets to perform TSS (Total Sum Scaling) before further normalization. TSS transformation computes relative abundances: \[ y_{ij} = \frac{n_{ij}}{\sum_{k=1}^p n_{ik}} \] for \(n_{ij}\) the counts of species \(j\) in sample \(i\), \(p\) the number of species and \(n\) the number of individuals.

abundances_TSS <- apply(abundances, 1, function(asample)
  asample / sum(asample))
df <- as.data.frame(t(abundances_TSS))
names(df) <- paste0("sample", 1:ncol(df))
ggplot(df, aes(x = sample1 + 1)) + geom_histogram(bins = 50) +
  theme_bw() + xlab("relative abundance (sample 1)") + scale_x_log10()

The next two histograms are based on the normalized counts with:

  • CLR (Centered Log Ratio) transformation: \[ \tilde{y}_{ij} = \log \frac{y_{ij}}{\sqrt[p]{\prod_{k=1}^p y_{ik}}}. \]
abundances_CLR <- logratio.transfo(abundances_TSS, logratio = "CLR", 
                                   offset = 1)
class(abundances_CLR) <- "matrix"
df <- data.frame(t(abundances_CLR))   # on transpose pour avoir une matrice 406*45 comme les autres
names(df) <- paste0("sample", 1:ncol(df))
ggplot(df, aes(x = sample1)) + geom_histogram(bins = 50) + theme_bw() + 
  xlab("counts (sample 1)")

  • ILR (Isometric Log Ratio) transformation \[ \tilde{\mathbf{Y}}' = \tilde{\mathbf{Y}} \times \mathbf{V} \] for \(\tilde{\mathbf{Y}}\) the matrix of CLR transformed data and a given matrix \(\mathbf{V}\) with \(p\) rows and \(p-1\) columns such that \(\mathbf{V} \mathbf{V}^\top = \mathbb{I}_{p-1}\) and \(\mathbf{V}^\top \mathbf{V} = \mathbb{I} + a \mathbf{1}\), \(a\) being any positive number and \(\mathbf{1}\) a vector full of 1.
abundances_ILR <- logratio.transfo(abundances_TSS, logratio = "ILR", 
                                   offset = 1)
class(abundances_ILR) <- "matrix"
df <- data.frame(t(abundances_ILR))  # on transpose pour avoir une matrice 406*45 comme les autres
names(df) <- paste0("sample", 1:ncol(df))
ggplot(df, aes(x = sample1)) + geom_histogram(bins = 50) + theme_bw() + 
  xlab("counts (sample 1)")

  • CSS transformation, which is an adaptative extension for metagenomic data of the quantile normalization used in microarray expression datasets. It is designed so as to account for technical differences between samples.
abundances_CSS <- newMRexperiment(abundances)
abundances_CSS <- cumNorm(abundances_CSS)
## Default value being used.
df <- data.frame(MRcounts(abundances_CSS))
names(df) <- paste0("sample", 1:ncol(df))
ggplot(df, aes(x = sample1 + 1)) + geom_histogram(bins = 50) + theme_bw() + 
  xlab("counts (sample 1)") + scale_y_log10() + scale_x_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 17 rows containing missing values (geom_bar).

The less asymetric distribution seems to be the one obtained with the CLR transformation and the log-transformed CSS.

Exploratory analysis: distribution of all samples

Distributions of all samples according to the type of transformation and the sample is provided below:

df_log <- log10(abundances + 1)
names(df_log) <- paste0("Sample", 1:ncol(df_log))
df_log <- melt(df_log)
## No id variables; using all as measure variables
df_CLR <- data.frame(t(abundances_CLR))
names(df_CLR) <- paste0("Sample", 1:ncol(df_CLR))
df_CLR <- melt(df_CLR)
## No id variables; using all as measure variables
df_ILR <- data.frame(t(abundances_ILR))
names(df_ILR) <- paste0("Sample", 1:ncol(df_ILR))
df_ILR <- melt(df_ILR)
## No id variables; using all as measure variables
df_CSS <- data.frame(log(MRcounts(abundances_CSS)) + 1)
names(df_CSS) <- paste0("Sample", 1:ncol(df_CSS))
df_CSS <- melt(df_CSS)
## No id variables; using all as measure variables
all_sizes <- c(nrow(df_log), nrow(df_CLR), nrow(df_ILR), nrow(df_CSS))
df <- data.frame(rbind(df_log, df_CLR, df_ILR, df_CSS),
                 "type" = rep(c("log", "CLR", "ILR", "log-CSS"), all_sizes))

#on transforme notre matrice abundances en vecteur, en gros pour le veau 1 - premier valeur, veau 1 - 2?me valeur et ainsi de suite

ggplot(df, aes(x = variable, y = value)) + geom_boxplot() + theme_bw() +
  facet_wrap(~ type, scales = "free_y") + xlab("samples") +
  theme(axis.text.x = element_blank())
## Warning: Removed 10491 rows containing non-finite values (stat_boxplot).

# echelle dif?rente pour chaque boxplot

Exploratory analysis: PCA

A first exploratory analysis is performed with PCA on (merged) raw counts with log transformation:

pca_raw <- pca(log(t(abundances) + 1), ncomp = ncol(abundances), 
               logratio = 'none')
plot(pca_raw)

that shows a good percentage of explained variance for the first axis.

Projection of the individuals on the first two PCs also shows a good separation between the two conditions:

plotIndiv(pca_raw, 
          comp = c(1,2),
          pch = 16, 
          ind.names = FALSE, 
          group = condition, 
          col.per.group = color.mixo(1:2),
          legend = TRUE)

The same analysis is used with TSS normalized counts subsequently transformed by CLR or ILR (which is the expected analysis):

pca_CLR <- pca(abundances_TSS + 1, ncomp = nrow(abundances_TSS),
               logratio = 'CLR')
plot(pca_CLR)

plotIndiv(pca_CLR, 
          comp = c(1,2),
          pch = 16, 
          ind.names = FALSE, 
          group = condition, 
          col.per.group = color.mixo(1:2),
          legend = TRUE)

pca_ILR <- pca(abundances_TSS + 1, ncomp = nrow(abundances_TSS) - 1,
               logratio = 'ILR')
plot(pca_ILR)

plotIndiv(pca_ILR, 
          comp = c(1,2),
          pch = 16, 
          ind.names = FALSE, 
          group = condition, 
          col.per.group = color.mixo(1:2),
          legend = TRUE)

log_CSS <- log(MRcounts(abundances_CSS) + 1)
pca_CSS <- pca(t(log_CSS), ncomp = ncol(log_CSS))
plot(pca_CSS)

plotIndiv(pca_CSS, 
          comp = c(1,2),
          pch = 16, 
          ind.names = FALSE, 
          group = condition, 
          col.per.group = color.mixo(1:2),
          legend = TRUE)

Differences between the two types of samples

A first PLS-DA is computed (with 10-fold CV) to check the efficiency of the method and which type of distance to use in its computation.

with log transformation

clean_log <- data.frame(log(t(abundances[ ,id_abundances != "29"]) + 1))
names(clean_log) <- paste0(species, 1:length(species))
clean_condition <- factor(condition[id_abundances != "29"])
clean_id <- factor(id_abundances[id_abundances != "29"])

set.seed(11)
res_plsda <- plsda(clean_log, clean_condition, ncomp = nlevels(clean_condition))
res_perf <- perf(res_plsda, validation = 'Mfold', folds = 5,
                  progressBar = FALSE, nrepeat = 20)
plot(res_perf, overlay = 'measure', sd = TRUE)

plotIndiv(res_plsda , comp = c(1, 2), ind.names = FALSE, ellipse = TRUE, 
          legend = TRUE, title = 'PLS-DA Comp 1 - 2')

PLS-DA shows a good separation between the two groups and indicates that the Mahalanobis distance provides the lower overall classification error.

Then, sparse PLS-DA is used (with the multilevel approach) to check which number of components to select.

set.seed(33)
res_plsda <- tune.splsda(clean_log, clean_condition, 
                         ncomp = nlevels(clean_condition),
                         multilevel = clean_id,
                         test.keepX = c(seq(5, 200, 5)), validation = 'Mfold', 
                         folds = 10, dist = 'mahalanobis.dist', nrepeat = 10,
                         progressBar = F)
## Splitting the variation for 1 level factor.
plot(res_plsda)

sel_keepX <- res_plsda$choice.keepX[1:2]
sel_keepX
## comp1 comp2 
##     5     5

Finally sparse PLS-DA is performed and the variables explaining the two types of samples are obtained:

res_splsda <- splsda(clean_log, clean_condition, 
                     ncomp = nlevels(clean_condition), multilevel = clean_id,
                     keepX = sel_keepX)
## Splitting the variation for 1 level factor.
plotIndiv(res_splsda, comp = c(1,2), ind.names = FALSE, ellipse = TRUE, 
          legend = TRUE, title = 'sPLS-DA Comp 1 - 2')

head(selectVar(res_splsda, comp = 1)$value)
##                                    value.var
## Corynebacterium 1185             -0.74967488
## Mycoplasma bovoculi M165/69288   -0.43197673
## Corynebacterium 1277             -0.36851309
## Corynebacterium camporealensis74 -0.33540474
## Intestinibacter155               -0.05554666
plotLoadings(res_splsda, comp = 1, method = 'mean', contrib = 'max',
             size.title = 1)

plotLoadings(res_splsda, comp = 2, method = 'mean', contrib = 'max',
             size.title = 1)

with TSS+CLR transformation

clean_CLR <- data.frame(abundances_CLR[id_abundances != "29",])
names(clean_CLR) <- paste0(species, 1:length(species))

set.seed(11)
res_plsda <- plsda(clean_CLR, clean_condition, ncomp = nlevels(clean_condition))
res_perf <- perf(res_plsda, validation = 'Mfold', folds = 5,
                 progressBar = FALSE, nrepeat = 20)
plot(res_perf, overlay = 'measure', sd = TRUE)

plotIndiv(res_plsda , comp = c(1, 2), ind.names = FALSE, ellipse = TRUE, 
          legend = TRUE, title = 'PLS-DA Comp 1 - 2')

set.seed(33)
res_plsda <- tune.splsda(clean_CLR, clean_condition, 
                         ncomp = nlevels(clean_condition),
                         multilevel = clean_id,
                         test.keepX = c(seq(5, 200, 5)), validation = 'Mfold', 
                         folds = 10, dist = 'mahalanobis.dist', nrepeat = 10,
                         progressBar = F)
## Splitting the variation for 1 level factor.
plot(res_plsda)

sel_keepX <- res_plsda$choice.keepX[1:2]
sel_keepX
## comp1 comp2 
##   185    35
res_splsda <- splsda(clean_CLR, clean_condition, 
                     ncomp = nlevels(clean_condition), multilevel = clean_id,
                     keepX = sel_keepX)
## Splitting the variation for 1 level factor.
plotIndiv(res_splsda, comp = c(1,2), ind.names = FALSE, ellipse = TRUE, 
          legend = TRUE, title = 'sPLS-DA Comp 1 - 2')

head(selectVar(res_splsda, comp = 1)$value)
##                                  value.var
## Corynebacterium falsenii383     -0.2107659
## Pseudomonas sp.229               0.2037964
## Providencia317                   0.1982473
## Stenotrophomonas maltophilia202  0.1881363
## Mycoplasma dispar205             0.1761092
## Leptospira broomii209            0.1736302
plotLoadings(res_splsda, comp = 1, method = 'mean', contrib = 'max',
             size.title = 1)

plotLoadings(res_splsda, comp = 2, method = 'mean', contrib = 'max',
             size.title = 1)

Sans Multilevel

with log transformation

set.seed(11)
res_plsda <- plsda(clean_log, clean_condition, ncomp = nlevels(clean_condition))
res_perf <- perf(res_plsda, validation = 'Mfold', folds = 5,
                 progressBar = FALSE, nrepeat = 20)
plot(res_perf, overlay = 'measure', sd = TRUE)

plotIndiv(res_plsda , comp = c(1, 2), ind.names = FALSE, ellipse = TRUE, 
          legend = TRUE, title = 'PLS-DA Comp 1 - 2')

set.seed(33)
res_plsda <- tune.splsda(clean_log, clean_condition, 
                         ncomp = nlevels(clean_condition),
                         test.keepX = c(seq(5, 200, 5)), validation = 'Mfold', 
                         folds = 10, dist = 'mahalanobis.dist', nrepeat = 10,
                         progressBar = F)

plot(res_plsda)

sel_keepX <- res_plsda$choice.keepX[1:2]
sel_keepX
## comp1 comp2 
##    15    75
res_splsda <- splsda(clean_log, clean_condition, 
                     ncomp = nlevels(clean_condition), multilevel = clean_id,
                     keepX = sel_keepX)
## Splitting the variation for 1 level factor.
plotIndiv(res_splsda, comp = c(1,2), ind.names = FALSE, ellipse = TRUE, 
          legend = TRUE, title = 'sPLS-DA Comp 1 - 2')

head(selectVar(res_splsda, comp = 1)$value)
##                                   value.var
## Corynebacterium 1185             -0.4673617
## Mycoplasma bovoculi M165/69288   -0.3854715
## Corynebacterium 1277             -0.3691130
## Corynebacterium camporealensis74 -0.3605789
## Intestinibacter155               -0.2884424
## Peptoclostridium242              -0.2741246
plotLoadings(res_splsda, comp = 1, method = 'mean', contrib = 'max',
             size.title = 1)

plotLoadings(res_splsda, comp = 2, method = 'mean', contrib = 'max',
             size.title = 1)

with TSS+CLR transformation

set.seed(11)
res_plsda <- plsda(clean_CLR, clean_condition, ncomp = nlevels(clean_condition))
res_perf <- perf(res_plsda, validation = 'Mfold', folds = 5,
                 progressBar = FALSE, nrepeat = 20)
plot(res_perf, overlay = 'measure', sd = TRUE)

plotIndiv(res_plsda , comp = c(1, 2), ind.names = FALSE, ellipse = TRUE, 
          legend = TRUE, title = 'PLS-DA Comp 1 - 2')

set.seed(33)
res_plsda <- tune.splsda(clean_CLR, clean_condition, 
                         ncomp = nlevels(clean_condition),
                         test.keepX = c(seq(5, 200, 5)), validation = 'Mfold', 
                         folds = 10, dist = 'mahalanobis.dist', nrepeat = 10,
                         progressBar = F)

plot(res_plsda)

sel_keepX <- res_plsda$choice.keepX[1:2]
sel_keepX
## comp1 comp2 
##     5    75
res_splsda <- splsda(clean_CLR, clean_condition, 
                     ncomp = nlevels(clean_condition), multilevel = clean_id,
                     keepX = c(5,25))
## Splitting the variation for 1 level factor.
plotIndiv(res_splsda, comp = c(1,2), ind.names = FALSE, ellipse = TRUE, 
          legend = TRUE, title = 'sPLS-DA Comp 1 - 2')

head(selectVar(res_splsda, comp = 1)$value)
##                                   value.var
## Corynebacterium falsenii383     -0.66572850
## Pseudomonas sp.229               0.54078805
## Providencia317                   0.44130810
## Stenotrophomonas maltophilia202  0.26005018
## Mycoplasma dispar205             0.04443997
plotLoadings(res_splsda, comp = 1, method = 'mean', contrib = 'max',
             size.title = 1)

plotLoadings(res_splsda, comp = 2, method = 'mean', contrib = 'max',
             size.title = 1)

Fusion des doublons

list_doublons<-unique(names(which(table(df_abundances[ ,1]) > 1)))
fusion_doublons<-as.data.frame(t(sapply(list_doublons,FUN=function(x){as.data.frame(t(colSums(abundances[grep(x,df_abundances[,1]),])))})))
list_pas_doublons <- unique(names(which(table(df_abundances[ ,1]) == 1)))
data <- rbind(abundances[match(list_pas_doublons,df_abundances[,1]),],fusion_doublons)

Random Forest

rownames(data)<-c(list_pas_doublons,list_doublons)
colnames(data) <- id_abundances
data <- data[,-grep("29",id_abundances)]
dim(data)
## [1] 273  44
colnames(data) <- paste0("sample",colnames(data))

df_pathogenes<-read.delim("../data/pathogenes.csv",sep = ",")
pathogenes<-df_pathogenes[,-c(1,9)]
pathogenes<-as.data.frame(ifelse(pathogenes=='p',1,0))
pathogenes<-as.data.frame(apply(t(pathogenes),1,as.factor))
id_pathogenes<-df_pathogenes[,1]
id_pathogenes <- gsub("-","0",id_pathogenes)
id_pathogenes <- gsub(" ","",id_pathogenes)
id_pathogenes <- sapply(as.character(id_pathogenes),FUN=function(x){substr(x,nchar(x)-1,nchar(x))})
id_pathogenes <- id_pathogenes[-grep("29",id_pathogenes)]
id_pathogenes <- paste0(id_pathogenes,"_",rev(condition[-grep("29",id_abundances)]))

Matcher les noms des veaux :

id_abundances_bis <- paste0(id_abundances[-grep("29",id_abundances)],"_",condition[-grep("29",id_abundances)])
Y<-pathogenes[match(id_abundances_bis,id_pathogenes),] 
rownames(Y)<-id_abundances_bis

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:Biobase':
## 
##     combine
## The following object is masked from 'package:BiocGenerics':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
test <- as.data.frame(t(data))
test <- as.data.frame(sapply(test,as.integer))

colnames(test)<-paste0("var_",1:ncol(test))  # chgt nom des variables pour que RF marche

fit <- randomForest(Y$Ct.Coronavirus ~ ., data = test,ntree=1000,mtry=30)
print(fit$confusion)
##    0  1 class.error
## 0 10 12   0.5454545
## 1 12 10   0.5454545
# variables qui discriminent le mieux:
varImpPlot(fit)

# liste des 10 variables qui discriminent le mieux:
(fit$importance[order(fit$importance[, 1], decreasing = TRUE), ])[1:10]
##    var_80   var_257   var_228   var_259   var_162   var_109     var_4 
## 0.3903818 0.3834907 0.3736368 0.3468617 0.3394121 0.3133421 0.3122722 
##   var_193   var_181   var_271 
## 0.2998659 0.2930991 0.2820829
# graphique montrant comment r?duit l'OOB en fonction du nombre d'arbres g?n?r?s
plot(fit$err.rate[, 1], type = "l", xlab = "nombre d'arbres", ylab = "erreur OOB")

Prédictions obtenues :

cat(fit$predicted)
## 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 2 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1

Session information

sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
## 
## locale:
##  [1] LC_CTYPE=fr_FR.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=fr_FR.UTF-8    
##  [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=fr_FR.UTF-8   
##  [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] randomForest_4.6-12  reshape2_1.4         metagenomeSeq_1.18.0
##  [4] RColorBrewer_1.1-2   glmnet_2.0-13        foreach_1.4.3       
##  [7] Matrix_1.2-11        limma_3.32.10        Biobase_2.36.2      
## [10] BiocGenerics_0.22.1  mixOmics_6.3.0       ggplot2_2.2.1       
## [13] lattice_0.20-35      MASS_7.3-47         
## 
## loaded via a namespace (and not attached):
##  [1] gtools_3.5.0       purrr_0.2.4        colorspace_1.2-4  
##  [4] htmltools_0.3.6    yaml_2.1.14        rlang_0.1.4       
##  [7] glue_1.2.0         bindrcpp_0.2       matrixStats_0.52.2
## [10] plyr_1.8.3         bindr_0.1          stringr_1.2.0     
## [13] munsell_0.4.2      gtable_0.1.2       caTools_1.17.1    
## [16] htmlwidgets_0.9    codetools_0.2-15   evaluate_0.10.1   
## [19] labeling_0.3       knitr_1.17         httpuv_1.3.5      
## [22] rARPACK_0.11-0     Rcpp_0.12.13       KernSmooth_2.23-15
## [25] xtable_1.8-2       corpcor_1.6.9      scales_0.5.0      
## [28] backports_1.1.1    gdata_2.18.0       jsonlite_1.5      
## [31] mime_0.5           RSpectra_0.12-0    gplots_3.0.1      
## [34] gridExtra_2.3      ellipse_0.3-8      digest_0.6.12     
## [37] stringi_1.1.5      dplyr_0.7.4        shiny_1.0.5       
## [40] grid_3.4.1         rprojroot_1.2      bitops_1.0-6      
## [43] tools_3.4.1        magrittr_1.5       rgl_0.98.1        
## [46] lazyeval_0.2.1     tibble_1.3.4       tidyr_0.7.2       
## [49] pkgconfig_2.0.1    assertthat_0.2.0   rmarkdown_1.7     
## [52] iterators_1.0.8    R6_2.2.2           igraph_1.1.2      
## [55] compiler_3.4.1